Introduction

One of the most challenging parts of making inferences from publicly available DNA sequences for metabarcoding experiments is obtaining a set of reference sequences with reliable and specific taxonomy information. If the locus being studied is a standard barcoding locus and the taxa of interest well known, then there is likely a curated database to get sequences from. The problem becomes more difficult with non-standard loci or taxa. Some examples of curated databases are shown below:

data <- data.frame(Database = c("UNITE", "RDP", "ITS1", "PR2", "SILVA", "Greengenes"),
                   Locus = c("ITS", "16S, 28S rRNA", "ITS1", "18S", "SSU, LSU", "16S"),
                   Taxon = c("Fungi", "Bacteria, Archea, Fungi", "Fungi",
                             "Protists", "All", "Bacteria, Archaea"))
knitr::kable(data)
Database Locus Taxon
UNITE ITS Fungi
RDP 16S, 28S rRNA Bacteria, Archea, Fungi
ITS1 ITS1 Fungi
PR2 18S Protists
SILVA SSU, LSU All
Greengenes 16S Bacteria, Archaea

Parsing taxonomy data with extract_taxonomy

There are a large number of sequence header/metadata formats and most of them are specific to a particular database. This makes it difficult to compare and combine data from diverse sources. Rather than exacerbating the syntactic pollution with another custom format or creating a parser for every database-specific format, the function extract_taxonomy parses taxonomy information from arbitrary characters strings (e.g. sequence headers) identified by a regex expression. Although the motivation for creating extract_taxonomy is to parse FASTA sequence headers, the function is not specific to FASTA or even to sequence information, so I will use the word “observation” instead of “sequence header” from now on. Any list of strings that contain identity and taxonomy information of a set of “observations” is valid.

The function communicates with online databases (principally implemented using taxize) to infer missing information from information supplied. For example, if a GenBank accession number was the only information available, the taxon id and classification would be retrieved from the NCBI databases. At the minimum, the output of extract_taxonomy consists of unique taxon identifiers, and the tree structure of the taxonomy shared by all sequences. This is all the information needed to fully characterize the taxonomic classification of a set of sequences.

Input

There are three important arguments that will usually be relevant: regex, key, and database.

  • The regex argument defines the structure of the strings and the location of relevant information (e.g. GenBank accession numbers) via regex capture groups (i.e. pairs of unescaped parentheses).
  • The key argument defines what kind of information is in each capture group, determining how it will be parsed. The elements of key has a defined set of possible values; see the extract_taxonomy help page for options.
  • The database argument determines the online taxonomy database that will be queried when necessary. Usually, this is ncbi, but others databases are possible (although not as well tested yet); see the extract_taxonomy help page for options.

Other arguments are important in special cases. Some will be explained in the examples below.

Output

An object of type taxmap is returned. See ?taxmap for more details.

Parsing FASTA files

FASTA files are one of the most common formats that sometimes contain taxonomic data. The read.FASTA function from the ape (Paradis Claude, et al., 2004) package is commonly used to parse FASTA files in R and extract_taxonomy recognizes its output. When the output of ape::read.FASTA is given to extract_taxonomy, the headers are parsed and the sequences are saved in the output object.

Genbank FASTA headers

FASTA files downloaded from GenBank custom queries contain the GenBank id and the accession number/version to identify sequences. Taxonomic information can be retrieved using either of these identifiers. The following shows how to extract the GI numbers, accession id, and description from the headers. The GenBank accession number is being used to look up the taxonomy information (hence "obs_id" in key option), while the GI numbers and description are being returned without contributing taxonomic information (hence "obs_info" in key option).

library(metacoder)
file_path <- system.file("extdata", "ncbi_basidiomycetes.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
genbank_ex_data <- extract_taxonomy(sequences,
                                    regex = "^.*\\|(.*)\\|.*\\|(.*)\\|(.*)$",
                                    key = c(gi_no = "obs_info", "obs_id", desc = "obs_info"),
                                    database = "ncbi")

The sequence headers have the format:

[1] "gi|626494018|ref|NR_119774.1| Cortinarius cramesinus PDD 27173 ITS region; from TYPE material"        
[2] "gi|1028916741|ref|NR_137130.1| Lactarius perparvus GENT KW320 ITS region; from TYPE material"         
[3] "gi|1028916731|ref|NR_137120.1| Cortinarius elotoides IB 19870060 ITS region; from TYPE material"      
[4] "gi|1028916725|ref|NR_137114.1| Stomatisora psychotriicola PREM 60886 ITS region; from TYPE material"  
[5] "gi|1028916658|ref|NR_137047.1| Fomitiporia tenuis MUCL 44802 ITS region; from TYPE material"          
[6] "gi|1028916619|ref|NR_137008.1| Derxomyces qinlingensis CGMCC AS 2.2446 ITS region; from TYPE material"

We can plot the result using plot:

set.seed(2)
heat_tree(genbank_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          layout = "fruchterman")

UNITE FASTA headers

The format of the UNITE (Kõljalg Larsson, et al., 2005) FASTA release has two pieces of information from which classifications can be determined. The GenBank sequence identifier can be used to look up the taxon id from GenBank. Alternatively, the classifications specified in the header can be used to make an arbitrarily coded taxonomy.

Using the sequence identifier

The GenBank accession number in the second entry of UNITE sequence headers can be used to look up the taxon assigned to each sequence by GenBank. Looking up the taxon assignment using the sequence accession number means changes in sequence taxonomy since the UNITE FASTA file was downloaded will be included. Therefore, the taxonomy returned by GenBank could be different than the one in the header. However, some of the sequences in the UNITE database do not have a GenBank accession number; these IDs start with UDB and should be filtered out:

file_path <- system.file("extdata", "unite_general_release.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
unite_ex_data_1 <- extract_taxonomy(sequences[!grepl(pattern = "\\|UDB", names(sequences))],
                                    regex = "^(.*)\\|(.*)\\|(.*)\\|.*\\|(.*)$",
                                    key = c(seq_name = "obs_info", "obs_id",
                                            other_id = "obs_info", tax_string = "obs_info"),
                                    database = "ncbi")

The sequence headers have the format:

[1] "Lachnum_sp|JQ347180|SH189775.06FU|reps|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_sp"                              
[2] "Lachnellula_calyciformis|U59145|SH189776.06FU|refs|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnellula;s__Lachnellula_calyciformis"
[3] "Lachnum_sp|AM084756|SH189777.06FU|reps|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_sp"                              
[4] "Lachnum_sp|FM172814|SH189778.06FU|reps|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_sp"                              
[5] "Lachnum_sp|FN539058|SH189779.06FU|reps|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_sp"                              
[6] "Lachnum_pulverulentum|AB481260|SH189780.06FU|refs|k__Fungi;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Hyaloscyphaceae;g__Lachnum;s__Lachnum_pulverulentum"        
set.seed(10)
filter_taxa(unite_ex_data_1, name == "Fungi", subtaxa = TRUE) %>%
  heat_tree(node_size = n_obs,
            node_color = n_obs,
            node_label = name,
            layout = "davidson-harel",
            overlap_avoidance = 0.5)

Using included classification names to look up the taxon id

If you want to use the structure and names of the classification provided in the header, but still look up the official taxon id, you can provide "class" as the only key with taxonomic information. Since the UNITE classification also included the rank for each taxon, we will have to specify how to parse each taxon string using the class_regex and class_key options. These work the same as regex and key, except they apply to each element in a classification. You can still capture the sequence id or taxon id (assuming its present in the header) by using "obs_info" or "taxon_info" where you would otherwise use "obs_id" or "taxon_id".

file_path <- system.file("extdata", "unite_general_release.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
unite_ex_data_2 <- extract_taxonomy(sequences,
                                    regex = "^(.*)\\|(.*)\\|(.*)\\|.*\\|(.*)$",
                                    key = c(seq_name = "obs_info", seq_id = "obs_info",
                                            other_id = "obs_info", "class"),
                                    class_regex = "^(.*)__(.*)$",
                                    class_key = c(unite_rank = "taxon_info", "name"),
                                    class_sep = ";",
                                    database = "ncbi")
heat_tree(unite_ex_data_2,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name)

Note that the taxon name (entry 1) and the sequence id (entry 2) are now encoded as "obs_info", causing them to be interpreted as generic data. This means that only the classification string (e.g. k__Fungi;p__Ascomycota;c__Leotiomycetes) will be interpreted as having taxonomic information, but the other information will also be included in the output in columns named name and sequence_id. The unique taxon id for each taxon encountered will be looked up and taxa not found will be encoded as NA.

Using the included classifications to generate arbitrary ids

It is also possible to build a custom taxonomy encoding using the taxonomy in the sequence headers without looking up the official taxon ids of each taxon. Taxa will be assigned arbitrary taxon ids that will be specific to the current analysis. This is the method that is most useful if available, since it does not require an internet connection. To do this, leave off the database option from the previous example.

file_path <- system.file("extdata", "unite_general_release.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
unite_ex_data_3 <- extract_taxonomy(sequences,
                                    regex = "^(.*)\\|(.*)\\|(.*)\\|.*\\|(.*)$",
                                    key = c(seq_name = "obs_info", seq_id = "obs_info",
                                            other_id = "obs_info", "class"),
                                    class_regex = "^(.*)__(.*)$",
                                    class_key = c(unite_rank = "taxon_info", "name"),
                                    class_sep = ";")
heat_tree(unite_ex_data_3,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name)

ITS1 DB FASTA headers

The ITS1 database FASTA header includes the GenBank taxon id. In the example below, both the "obs_id" and "taxon_id" keys are provided, but only the "taxon_id" is used to look up taxonomy information since it has precedence.

file_path <- system.file("extdata", "its1_chytridiomycota_hmm.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
its1_ex_data <- extract_taxonomy(sequences,
                                 regex = "^(.*)\\|(.*)\\|tax_id:(.*)\\|(.*)$",
                                 key = c("obs_id", taxon_name = "taxon_info",
                                         "taxon_id", description = "obs_info"),
                                 database = "ncbi")

The sequence headers have the format:

[1] "HQ191393_ITS1_HMM|uncultured Chytridiomycota|tax_id:175247|ITS1 located by HMM profiles,179bp length; "
[2] "HQ191391_ITS1_HMM|uncultured Chytridiomycota|tax_id:175247|ITS1 located by HMM profiles,251bp length; "
[3] "KJ464412_ITS1_HMM|Rhizophydium sphaerotheca|tax_id:109902|ITS1 located by HMM profiles,145bp length; " 
[4] "HQ191291_ITS1_HMM|uncultured Chytridiomycota|tax_id:175247|ITS1 located by HMM profiles,171bp length; "
[5] "HQ191292_ITS1_HMM|uncultured Chytridiomycota|tax_id:175247|ITS1 located by HMM profiles,178bp length; "
[6] "HQ191295_ITS1_HMM|uncultured Chytridiomycota|tax_id:175247|ITS1 located by HMM profiles,177bp length; "
set.seed(1)
heat_tree(its1_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          layout = "fruchterman-reingold")

PR2 FASTA headers

The first observation in the PR2 (Guillou Bachar, et al., 2012) header is sometimes a GenBank accession number, but not always. Therefore, the best option is to use the included taxonomy information.

file_path <- system.file("extdata", "pr2_stramenopiles_gb203.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
pr2_ex_data <- extract_taxonomy(sequences,
                                regex = "^(.*\\..*?)\\|(.*)$",
                                key = c("obs_id", "class"),
                                class_sep = "\\|")

The sequence headers have the format:

[1] "10-044.1.1773|Eukaryota|Stramenopiles|Stramenopiles_X|Oomycota|Oomycota_X|Oomycota_XX|Oomycota_XXX|Oomycota_XXX+sp."                                 
[2] "10-045.1.1772|Eukaryota|Stramenopiles|Stramenopiles_X|Oomycota|Oomycota_X|Oomycota_XX|Oomycota_XXX|Oomycota_XXX+sp."                                 
[3] "10-150.1.1771|Eukaryota|Stramenopiles|Stramenopiles_X|Oomycota|Oomycota_X|Oomycota_XX|Oomycota_XXX|Oomycota_XXX+sp."                                 
[4] "10-374.1.1778|Eukaryota|Stramenopiles|Stramenopiles_X|Oomycota|Oomycota_X|Oomycota_XX|Oomycota_XXX|Oomycota_XXX+sp."                                 
[5] "HQ166719.1.1035_U|Eukaryota|Stramenopiles|Ochrophyta|Bacillariophyta|Bacillariophyta_X|Polar-centric-Mediophyceae|Chaetoceros|Chaetoceros+calcitrans"
[6] "JQ781961.1.1805_U|Eukaryota|Stramenopiles|Stramenopiles_X|MAST|MAST-4|MAST-4A|MAST-4A_X|MAST-4A_X+sp."                                               
heat_tree(pr2_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name)

RDP FASTA headers

This RDP (Cole Wang, et al., 2009) FASTA file does not contain any references to taxon or sequence ids that could be used to look up more information. Instead, we will parse the taxonomy information included in the sequence headers. In this case both the rank and taxon name are supplied.

file_path <- system.file("extdata", "rdp_current_Archaea_unaligned.fa", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
rdp_ex_data <- extract_taxonomy(sequences,
                                regex = "^(.*?) (.*)\\tLineage=(.*)",
                                key = c(id = "obs_info", description = "obs_info", "class"),
                                class_regex = "(.+?);(.*?);",
                                class_key = c("name", "taxon_info"))
[1] "S001191995 uncultured archaeon; LCDARCH35\tLineage=Root;rootrank;Archaea;domain;\"Euryarchaeota\";phylum;\"Methanomicrobia\";class;Methanosarcinales;order;untaxmap_Methanosarcinales;"
[2] "S002342293 uncultured archaeon; Arch-2\tLineage=Root;rootrank;Archaea;domain;\"Euryarchaeota\";phylum;untaxmap_\"Euryarchaeota\";"                                                     
[3] "S002876736 uncultured archaeon; AydatHIV_14\tLineage=Root;rootrank;Archaea;domain;\"Thaumarchaeota\";phylum;Nitrosopumilales;order;Nitrosopumilaceae;family;Nitrosopumilus;genus"      
[4] "S002934720 uncultured archaeon; A0423R002_A13\tLineage=Root;rootrank;Archaea;domain;\"Euryarchaeota\";phylum;\"Methanomicrobia\";class;untaxmap_\"Methanomicrobia\";"                  
[5] "S003785628 uncultured archaeon; PC500-1A66\tLineage=Root;rootrank;Archaea;domain;\"Crenarchaeota\";phylum;Thermoprotei;class;untaxmap_Thermoprotei;"                                   
[6] "S000511720 uncultured crenarchaeote; I3K-F4\tLineage=Root;rootrank;Archaea;domain;\"Thaumarchaeota\";phylum;Nitrosopumilales;order;Nitrosopumilaceae;family;Nitrosopumilus;genus"      

Note that the same character ( ; ) is used to delineate different ranks and taxa. This makes it the class_sep option inappropriate. If the class_sep option is not defined, each match of class_regex is considered to be a different taxa. Therefore, we can use a class_regex that matches two ; to correctly parse this awkward format.

set.seed(3)
heat_tree(rdp_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          layout = "da")

SILVA FASTA headers

SILVA citep("10.1093/nar/gks1219") has a relativly simple format with an embedded taxonomy:

file_path <- system.file("extdata", "silva_nr99.fasta", package = "metacoder")
sequences <- ape::read.FASTA(file_path)
silva_ex_data <- extract_taxonomy(sequences,
                                  regex = "^(.*?) (.*)$",
                                  key = c(id = "obs_info", "class"),
                                  class_sep = ";")
[1] "AB001438.1.1662 Eukaryota;SAR;Alveolata;Dinoflagellata;Dinophyceae;Gymnodiniphycidae;Kareniaceae;Gyrodinium sp."                                         
[2] "AB006051.1.1796 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Trebouxiophyceae;Chlorellales;Lobosphaera;Lobosphaera tirolensis"                    
[3] "FJ911802.1.1606 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Chelicerata;Arachnida;Acari;Poecilochirus sp. APGD-2010"
[4] "FJ911814.1.1608 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Chelicerata;Arachnida;Acari;Arrenoseius sp. APGD-2010"  
[5] "FJ911819.1.1607 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Chelicerata;Arachnida;Acari;Eviphis sp. 1 APGD-2010"    
[6] "FJ911835.1.1585 Eukaryota;Opisthokonta;Holozoa;Metazoa (Animalia);Eumetazoa;Bilateria;Arthropoda;Chelicerata;Arachnida;Acari;Dermanyssus hirundinis"     
set.seed(2)
heat_tree(silva_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          tree_label = name)

Parsing raw characters

A list of sequence IDs

Say someone emails you a list of ncbi ids that you want to plot. You could use the following code to get the taxonomic information.

raw <- "JQ086376.1 AM946981.2 JQ182735.1 CP001396.1 J02459.1 AC150248.3 X64334.1 CP001509.3 CP006698.1 AC198536.1 JF340119.2 KF771025.1 CP007136.1 CP007133.1 U39286.1 CP006584.1 EU421722.1 U03462.1 U03459.1 AC198467.1 V00638.1 CP007394.1 CP007392.1 HG941718.1 HG813083.1 HG813082.1 CP007391.1 HG813084.1 CP002516.1 KF561236.1 JX509734.1 AP010953.1 U39285.1 M15423.1 X98613.1 CP006784.1 CP007393.1 CU928163.2 AP009240.1 CP007025.1 CP006027.1 CP003301.1 CP003289.1 CP000946.1 CP002167.1 HG428755.1 JQ086370.1 CP001846.1 CP001925.1 X99439.1 AP010958.1 CP001368.1 AE014075.1 CP002212.1 CP003034.1 CP000243.1 AY940193.1 CP004009.1 JQ182732.1 U02453.1 AY927771.1 BA000007.2 CP003109.1 CP007390.1 U02426.1 U02425.1 CP006262.1 HG738867.1 U00096.3 FN554766.1 CP001855.1 L19898.1 AE005174.2 FJ188381.1 AK157373.1 JQ182733.1 U39284.1 U37692.1 AF129072.1 FM180568.1 CP001969.1 HE616528.1 CP002729.1 JF974339.1 AB248924.1 AB248923.1 CP002291.1 X98409.1 CU928161.2 CP003297.1 FJ797950.1 CP000038.1 U82598.1 CP002211.1 JQ806764.1 U03463.1 CP001665.1"
ids <- strsplit(raw, " ")[[1]]
contaminants <- extract_taxonomy(ids, key = "obs_id", database = "ncbi")
set.seed(3)
heat_tree(contaminants,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          tree_label = name,
          layout = "fruchterman-reingold")

Scraping taxon names from the web

Any list of taxon names can be parsed. One way to get those names from a website that does not have an API is to parse the HTML of the website using and html parser and XPATH. Note that this example uses ITIS instead of NCBI to look up classifications for the taxa scraped from The Plant List:

library(XML)
taxon_names <- htmlTreeParse("http://www.theplantlist.org/1.1/browse/B/") %>% 
  xmlRoot() %>%
  getNodeSet("//ul[@id='nametree']/li/a/i") %>% # The string is an XPATH expression
  sapply(xmlValue)
bryophytes_ex_data <- extract_taxonomy(taxon_names, key = "name", database = "itis")
set.seed(2)
heat_tree(bryophytes_ex_data,
          node_size = n_obs,
          node_color = n_obs,
          node_label = name,
          tree_label = name,
          layout = "davidson-harel")

References

[1] J. R. Cole, Q. Wang, et al. “The Ribosomal Database Project: improved alignments and new tools for rRNA analysis”. In: Nucleic Acids Research 37.Database (Jan. 2009), pp. D141-D145. DOI: 10.1093/nar/gkn879. .

[2] L. Guillou, D. Bachar, et al. “The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy”. In: Nucleic Acids Research 41.D1 (Nov. 2012), pp. D597-D604. DOI: 10.1093/nar/gks1160. .

[3] U. Kõljalg, K. Larsson, et al. “UNITE: a database providing web-based methods for the molecular identification of ectomycorrhizal fungi”. In: New Phytologist 166.3 (Mar. 2005), pp. 1063-1068. DOI: 10.1111/j.1469-8137.2005.01376.x. .

[4] E. Paradis, J. Claude, et al. “APE: Analyses of Phylogenetics and Evolution in R language”. In: Bioinformatics 20.2 (Jan. 2004), pp. 289-290. DOI: 10.1093/bioinformatics/btg412. .